Restricted Boltzmann Machines

Theory of RBMs and Applications

Author

Jessica Wells and Jason Gerstenberger (Advisor: Dr. Cohen)

Published

April 4, 2025

Introduction

Background

Restricted Boltzmann Machines (RBM) are a type of neural network that has been around since the 1980s. As a reminder to the reader, machine learning is generally divided into 3 categories: supervised learning (examples: classification tasks, regression), unsupervised learning (examples: clustering, dimensionality reduction, generative modeling), and reinforcement learning (examples: gaming/robotics). RBMs are primarily used for unsupervised learning tasks like dimensionality reduction and feature extraction, which help prepare datasets for machine learning models that may later be trained using supervised learning. They also have other applications which will be discussed further later.

Like Hopfield networks, Boltzmann machines are undirected graphical models, but they are different in that they are stochastic and can have hidden units. Both models are energy-based, meaning they learn by minimizing an energy function for each model (Smolensky et al. 1986). Boltzmann machines use a sigmoid activation function, which allows for the model to be probabilistic.

In the “Restricted” Boltzmann Machine, there are no interactions between neurons in the visible layer or between neurons in the hidden layer, creating a bipartite graph of neurons. Below is a diagram taken from Goodfellow, et al. (Goodfellow, Bengio, and Courville 2016) (p. 577) for visualization of the connections.

Code
reticulate::py_config()
python:         /Users/jessicawells/.virtualenvs/r-reticulate/bin/python
libpython:      /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/config-3.12-darwin/libpython3.12.dylib
pythonhome:     /Users/jessicawells/.virtualenvs/r-reticulate:/Users/jessicawells/.virtualenvs/r-reticulate
version:        3.12.6 (v3.12.6:a4a2d2b0d85, Sep  6 2024, 16:08:03) [Clang 13.0.0 (clang-1300.0.29.30)]
numpy:          /Users/jessicawells/.virtualenvs/r-reticulate/lib/python3.12/site-packages/numpy
numpy_version:  1.26.1


Goodfellow, et al. discuss the expense in drawing samples for most undirected graphical models; however, the RBM allows for block Gibbs sampling (p. 578) where the network alternates between sampling all hidden units simultaneously (etc. for visible). Derivatives are also simplified by the fact that the energy function of the RBM is a linear function of it’s parameters, which will be seen further in Methods.

RBMs are trained using a process called Contrastive Divergence (CD) (G. E. Hinton 2002) where the weights are updated to minimize the difference between samples from the data and samples from the model. Learning rate, batch size, and number of hidden units are all hyperparameters that can affect the ability of the training to converge successfully and learn the underlying structure of the data.

Applications

RBMs are probably best known for their success in collaborative filtering. The RBM model was used in the Netflix Prize competition to predict user ratings for movies, with the result that it outperformed the Singular Value Decomposition (SVD) method that was state-of-the-art at the time (Salakhutdinov, Mnih, and Hinton 2007). They have also been trained to recognize handwritten digits, such as the MNIST dataset (G. E. Hinton 2002).

RBMs have been successfully used to distinguish normal and anomalous network traffic. Their potential use in improving network security for companies in the future is promising. There is slow progress in network anomaly detection due to the difficulty of obtaining datasets for training and testing networks. Clients are often reluctant to divulge information that could potentially harm their networks. In a real-life dataset where one host had normal traffic and one was infected by a bot, discriminative RBM (DRBM) was able to successfully distinguish the normal from anomalous traffic. DRBM doesn’t rely on knowing the data distribution ahead of time, which is useful, except that it also causes the DRBM to overfit. As a result, when trying to use the same trained RBM on the KDD ’99 training dataset performance declined. (Fiore et al. 2013)

RBMs can provide greatly improved classification of brain disorders in MRI images. Generative Adversarial Networks (GANs) use two neural networks: a generator which generates fake data, and a discriminator which tries to distinguish between real and fake data. Loss from the discriminator is backpropagated through the generator so that both part are trained simultaneously. The RBM-GAN uses RBM features from real MRI images as inputs to the generator. Features from the discriminator are then used as inputs to a classifier. (Aslan, Dogan, and Koca 2023)

The many-body quantum wavefunction, which describes the quantum state of a system of particles is difficult to compute with classical computers. RBMs have been used to approximate it using variational Monte Carlo methods. (Melko et al. 2019)

RBMs are notoriously slow to train. The process of computing the activation probability requires the calculation of vector dot products. Lean Constrastive Divergence (LCD) is a method which adds two techniques to speed up the process of training RBMs. The first is bounds-based filtering where upper and lower bounds of the probability select only a range of dot products to perform. Second, the delta product involves only recalculating the changed portions of the vector dot product. (Ning, Pittman, and Shen 2018)

Methods

Below is the energy function of the RBM.

\[ E(v,h) = - \sum_{i} a_i v_i - \sum_{j} b_j h_j - \sum_{i} \sum_{j} v_i w_{i,j} h_j \tag{1}\] where vi and hj represent visible and hidden units; ai and bj are the bias terms of the visible and hidden units; and each w{i,j} (weight) element represents the interaction between the visible and hidden units. (Fischer and Igel 2012)

It is well known neural networks are prone to overfitting and often techniques such as early stopping are employed to prevent it. Some methods to prevent overfitting in RBMs are weight decay (L2 regularization), dropout, dropconnect, and weight uncertainty (Zhang et al. 2018). Dropout is a fairly well known concept in deep learning. For example, a dropout value of 0.3 added to a layer means 30% of neurons are dropped during training. This prevents the network from learning certain features too well. L2 regularization is also a commonly employed technique in deep learning. It assigns a penalty to large weights to allow for more generalization. Dropconnect is a method where a subset of weights within the network are randomly set to zero during training. Weight uncertainty is where each weight in the network has it’s own probability distribution vice a fixed value. This addition allows the network to learn more useful features.

If the learning rate is too high, training of the model may not converge. If it is too low, training may take a long time. To fully maximize the training of the model it is helpful to reduce the learning rate over time. This is known as learning rate decay. (G. Hinton 2010)

Model Categories

We train Logistic Regression (with and without RBM features as input), Feed Forward Network (with and without RBM features as input), and Convolutional Neural Network. Below is a brief reminder of the basics of each model.

For the models incoroporating the RBM, we take the Fashion MNIST features/pixels and train the RBM (unsupervised learning) to extract hidden features from the visible layer and then feed these features into either logistic regression or feed forward network. We then use the trained model to predict labels for the test data, evaluating how well the RBM-derived features perform in a supervised classification task.

1. Logistic Regression

Mathematically, the concept behind binary logistic regression is the logit (the natural logarithm of an odds ratio)(Peng, Lee, and Ingersoll 2002). However, since we have 10 labels, our classification task falls into “Multinomial Logistic Regression.”

\[ P(Y = k | X) = \frac{e^{\beta_{0k} + \beta_k^T X}}{\sum_{l=1}^{K} e^{\beta_{0l} + \beta_l^T X}} \tag{2}\]

2. Simple Feed Forward Neural Network

The feed forward network (FNN) is one where information flows in one direction from input to output with no loops or feedback. There can be zero hidden layers in between (called single FNN) or one or more hidden layers (multilayer FNN).
(Sazlı 2006)

3. Convolutional Neural Network

The convolutional neural network (CNN) is a type of feed forward network except that unlike the traditional ANN, CNNs are primarily used for pattern recognition with images (O’Shea and Nash 2015). The CNN has 3 layers which are stacked to form the full CNN: convolutional, pooling, and fully-connected layers.

Below is our Process for creating the RBM:

Step 1: We first initialize the RBM with random weights and biases and set visible units to 784 and hidden units to 256. We also set the number of contrastive divergence steps (k) to 1.
Step 2: Sample hidden units from visible. The math behind computing the hidden unit activations from the given input can be seen in Equation 3 (Fischer and Igel 2012) where the probability is used to sample from the Bernoulli distribution.
\[ p(H_i = 1 | \mathbf{v}) = \sigma \left( \sum_{j=1}^{m} w_{ij} v_j + c_i \right) \tag{3}\]

where p(.) is the probability of the ith hidden state being activated (=1) given the visible input vector. σ is the sigmoid activation function (below) which maps the weighted sum to a probability between 0 and 1. m is the number of visible units. wij is the weight connecting visible unit j to hidden unit i. vj is the value of the jth visible unit. and ci is the bias term for the hidden unit. \[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

Step 3: Sample visible units from hidden. The math behind computing visible unit activations from the hidden layer can be seen in Equation 4 (Fischer and Igel 2012) Visible states are sampled using the Bernoulli distribution. This way we can see how well the RBM learned from the inputs.
\[ p(V_j = 1 | \mathbf{h}) = \sigma \left( \sum_{i=1}^{n} w_{ij} h_i + b_j \right) \tag{4}\]

where p(.) is the probability of the ith visible unit being activated (=1) given the hidden vector h. σ is same as above. n is the number of hidden units. wij is the weight connecting hidden unit i to visible unit j. bj is the bias term for the jth visible unit.
Step 4: K=1 steps of Contrastive Divergence (Feed Forward, Feed Backward) which executes steps 2 and 3. Contrastive Divergence updates the RBM’s weights by minimizing the difference between the original input and the reconstructed input created by the RBM.
Step 5: Free energy is computed. The free energy F is given by the logarithm of the partition function Z (Oh, Baggag, and Nha 2020) where the partition function is
\[ Z(\theta) \equiv \sum_{v,h} e^{-E(v,h; \theta)} \tag{5}\] and the free energy function is
\[ F(\theta) = -\ln Z(\theta) \tag{6}\] where lower free energy means the RBM learned the visible state well.

Step 6: Train the RBM. Model weights updated via gradient descent.
Step 7: Feature extraction for classification with LR. The hidden layer activations of the RBM are used as features for Logistic Regression and Feed Forward Network.

Hyperparameter Tuning

We use the Tree-structured Parzen Estimator algorithm from Optuna (Akiba et al. 2019) to tune the hyperparameters of the RBM and the classifier models, and we use MLFlow (Zaharia et al. 2018) to record and visualize the results of the hyperparameter tuning process. The hyperparameters we tune include the learning rate, batch size, number of hidden units, and number of epochs.

Metrics Used

1. Accuracy
Accuracy is defined as the number of correct classifications divided by the total number of classifications
\[ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} \]

2. Macro F1 Score
Macro F1 score is the unweighted average of the individual F1 scores of each class. It takes no regard for class imbalance; however, we saw earlier the classes are all balanced in Fashion MNIST. The F1 score for each individual class is as follows \[ \text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \] where precision for each class is \[ \text{Precision} = \frac{TP}{TP + FP} \] and recall for each class is \[ \text{Recall} = \frac{TP}{TP + FN} \] The definitions of these terms for multiclass problems are more complicated than binary and are best displayed as examples.

Acronymn Example for a trouser image
TP = True Positives the image is a trouser and the model predicts a trouser
TN = True Negatives the image is not a trouser and the model predicts anything but trouser
FP = False Positives the image is anything but trouser but the model predicts trouser
FN = False Negatives the image is a trouser and the model predicts another class (like shirt)

As stated earlier, the individual F1 scores for each class are taken and averaged to compute the Macro F1 score in a multiclass problem like Fashion MNIST.

Analysis and Results

Data Exploration and Visualization

We use the Fashion MNIST dataset from Zalando Research (Xiao, Rasul, and Vollgraf 2017). The set includes 70,000 grayscale images of clothing items, 60,000 for training and 10,000 for testing. Each image is 28x28 pixels (784 pixels total). Each pixel has a value associated with it ranging from 0 (white) to 255 (very dark) – whole numbers only. There are 785 columns in total as one column is dedicated to the label.


There are 10 labels in total:

0 T-shirt/top
1 Trouser
2 Pullover
3 Dress
4 Coat
5 Sandal
6 Shirt
7 Sneaker
8 Bag
9 Ankle boot

Below we load the dataset.

Code
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
import torch
import torchvision.datasets
import torchvision.models
import torchvision.transforms as transforms
import matplotlib.pyplot as plt



train_data = torchvision.datasets.FashionMNIST(
    root="./data", 
    train=True, 
    download=True, 
    transform=transforms.ToTensor()  # Converts to tensor but does NOT normalize
)

test_data = torchvision.datasets.FashionMNIST(
    root="./data", 
    train=False, 
    download=True, 
    transform=transforms.ToTensor()  
)

Get the seventh image to show a sample

Code
# Extract the first image (or choose any index)
image_tensor, label = train_data[6]  # shape: [1, 28, 28]

# Convert tensor to NumPy array
image_array = image_tensor.numpy().squeeze()  

# Plot the image
plt.figure(figsize=(5,5))
plt.imshow(image_array, cmap="gray")
plt.title(f"FashionMNIST Image (Label: {label})")
plt.axis("off")  # Hide axes
(-0.5, 27.5, 27.5, -0.5)
Code
plt.show()

Code
train_images = train_data.data.numpy()  # Raw pixel values (0-255)
train_labels = train_data.targets.numpy()
X = train_images.reshape(-1, 784)  # Flatten 28x28 images into 1D (60000, 784)
Code
#print(train_images[:5])
flattened = train_images[:5].reshape(5, -1) 

# Create a DataFrame
df_flat = pd.DataFrame(flattened)
print(df_flat.head())
   0    1    2    3    4    5    6    ...  777  778  779  780  781  782  783
0    0    0    0    0    0    0    0  ...    0    0    0    0    0    0    0
1    0    0    0    0    0    1    0  ...   76    0    0    0    0    0    0
2    0    0    0    0    0    0    0  ...    0    0    0    0    0    0    0
3    0    0    0    0    0    0    0  ...    0    0    0    0    0    0    0
4    0    0    0    0    0    0    0  ...    0    0    0    0    0    0    0

[5 rows x 784 columns]
Code
#train_df.info() #datatypes are integers

There are no missing values in the data.

Code
print(np.isnan(train_images).any()) 
False

There appears to be no class imbalance

Code
unique_labels, counts = np.unique(train_labels, return_counts=True)

# Print the counts sorted by label
for label, count in zip(unique_labels, counts):
    print(f"Label {label}: {count}")
Label 0: 6000
Label 1: 6000
Label 2: 6000
Label 3: 6000
Label 4: 6000
Label 5: 6000
Label 6: 6000
Label 7: 6000
Label 8: 6000
Label 9: 6000
Code
print(f"X shape: {X.shape}")
X shape: (60000, 784)

t-SNE Visualization
t-distributed Stochastic Neighbor Embedding (t-SNE) is used here to visualize the separation between classes in a high-dimensional dataset.
Each point represents a single fashion item (e.g., T-shirt, Trouser, etc.), and the color corresponds to its true label across the 10 categories listed above.

Code
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Run t-SNE to reduce dimensionality
#embeddings = TSNE(n_jobs=2).fit_transform(X)

tsne = TSNE(n_jobs=-1, random_state=42)  # Use -1 to use all available cores
embeddings = tsne.fit_transform(X) #use scikitlearn instead


# Create scatter plot
figure = plt.figure(figsize=(15,7))
plt.scatter(embeddings[:, 0], embeddings[:, 1], c=train_labels,
            cmap=plt.cm.get_cmap("jet", 10), marker='.')
plt.colorbar(ticks=range(10))
<matplotlib.colorbar.Colorbar object at 0x3194f5460>
Code
plt.clim(-0.5, 9.5)
plt.title("t-SNE Visualization of Fashion MNIST")
plt.show()

What the visualization shows:
Class 1 (blue / Trousers) forms a clearly distinct and tightly packed cluster, indicating that the pixel patterns for trousers are less similar to those of other classes. In contrast, Classes 4 (Coat), 6 (Shirt), and 2 (Pullover) show significant overlap, suggesting that these clothing items are harder to distinguish visually and may lead to more confusion during classification.

Modeling and Results

Our Goal
We are classifying Fashion MNIST images into one of 10 categories. To evaluate performance, we’re comparing five different models — some trained on raw pixel values and others using features extracted by a Restricted Boltzmann Machine (RBM). Our objective is to assess whether incorporating RBM into the workflow improves classification accuracy compared to using raw image data alone.

Our Models
1. Logistic Regression on Fashion MNIST Data
2. Feed Forward Network on Fashion MNIST Data
3. Convolutional Neural Network on Fashion MNIST Data
4. Logistic Regression on RBM Hidden Features (of Fashion MNIST Data)
5. Feed Forward Network on RBM Hidden Features (of Fashion MNIST Data)

Note: Outputs (50 trials) and Code are below for each model. Both the code and output can be toggled by the reader.
• The first click reveals a toggle labeled “Code”.
• Clicking “Code” will show the output.
• Clicking again will switch from output to the actual code.
• Clicking “Show Code and Output” again will collapse both views.

Import Libraries and Re-load data for first 3 models

Code
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torchvision import datasets, transforms
import numpy as np
import mlflow
import optuna
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from torch.utils.data import DataLoader

# Set device
device = torch.device("mps")

# Load Fashion-MNIST dataset again for the first 3 models
transform = transforms.Compose([transforms.ToTensor()])
train_dataset = datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)
Code
#mlflow.end_run()
#run this in terminal when need to fully clean out expierment after you delete it in the ui
#rm -rf mlruns/.trash/*

Model 1: Logistic Regression on Fashion MNIST Data

Click to Show Code and Output
Code
from sklearn.metrics import f1_score

CLASSIFIER = "LogisticRegression"  # Change for FNN, LogisticRegression, or CNN



# Define CNN model
class FashionCNN(nn.Module):
    def __init__(self, filters1, filters2, kernel1, kernel2):
        super(FashionCNN, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=filters1, kernel_size=kernel1, padding=1),
            nn.BatchNorm2d(filters1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.layer2 = nn.Sequential(
            nn.Conv2d(in_channels=filters1, out_channels=filters2, kernel_size=kernel2),
            nn.BatchNorm2d(filters2),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.fc1 = None #initialize first fully connected layer as none, defined later in fwd
        self.drop = nn.Dropout2d(0.25)
        self.fc2 = nn.Linear(in_features=600, out_features=120)
        self.fc3 = nn.Linear(in_features=120, out_features=10)
        

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        #Flatten tensor dynamically, preserve batch size
        out = out.view(out.size(0), -1) 
        if self.fc1 is None:
            self.fc1 = nn.Linear(out.shape[1], 600).to(x.device)
        out = self.fc1(out)
        out = self.drop(out)
        out = self.fc2(out)
        out = self.fc3(out)
        return out


# Define Optuna objective function
def objective(trial):
      # Set MLflow experiment name
    if CLASSIFIER == "LogisticRegression":
        experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-noRBM")
    elif CLASSIFIER == "FNN":
        experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-noRBM")
    elif CLASSIFIER == "CNN":
        experiment = mlflow.set_experiment("new-pytorch-fmnist-cnn-noRBM")
    batch_size = trial.suggest_int("batch_size", 64, 256, step=32)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    mlflow.start_run(experiment_id=experiment.experiment_id)
    num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5) 
    mlflow.log_param("num_classifier_epochs", num_classifier_epochs)

    if CLASSIFIER == "FNN":
        hidden_size = trial.suggest_int("fnn_hidden", 192, 384)
        learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025)

        mlflow.log_param("classifier", "FNN")
        mlflow.log_param("fnn_hidden", hidden_size)
        mlflow.log_param("learning_rate", learning_rate)

        model = nn.Sequential(
            nn.Linear(784, hidden_size), 
            nn.ReLU(),
            nn.Linear(hidden_size, 10)
        ).to(device)

        optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    elif CLASSIFIER == "CNN":
        filters1 = trial.suggest_int("filters1", 16, 64, step=16)
        filters2 = trial.suggest_int("filters2", 32, 128, step=32)
        kernel1 = trial.suggest_int("kernel1", 3, 5)
        kernel2 = trial.suggest_int("kernel2", 3, 5)
        learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025)

        mlflow.log_param("classifier", "CNN")
        mlflow.log_param("filters1", filters1)
        mlflow.log_param("filters2", filters2)
        mlflow.log_param("kernel1", kernel1)
        mlflow.log_param("kernel2", kernel2)
        mlflow.log_param("learning_rate", learning_rate)

        model = FashionCNN(filters1, filters2, kernel1, kernel2).to(device)
        optimizer = optim.Adam(model.parameters(), lr=learning_rate)

      
    elif CLASSIFIER == "LogisticRegression":
        mlflow.log_param("classifier", "LogisticRegression")
    
        # Prepare data for Logistic Regression (Flatten 28x28 images to 784 features)
        train_features = train_dataset.data.view(-1, 784).numpy()
        train_labels = train_dataset.targets.numpy()
        test_features = test_dataset.data.view(-1, 784).numpy()
        test_labels = test_dataset.targets.numpy()
    
        # Normalize the pixel values to [0,1] for better convergence
        train_features = train_features / 255.0
        test_features = test_features / 255.0
    
    
        C = trial.suggest_float("C", 0.01, 10.0, log=True)  
        solver = "saga" 
    
        model = LogisticRegression(C=C, max_iter=num_classifier_epochs, solver=solver)
        model.fit(train_features, train_labels)
    
    
        predictions = model.predict(test_features)
        accuracy = accuracy_score(test_labels, predictions) * 100
        
        macro_f1 = f1_score(test_labels, predictions, average="macro") #for f1
        print(f"Logistic Regression Test Accuracy: {accuracy:.2f}%")
        print(f"Macro F1 Score: {macro_f1:.4f}") #for f1
    
        mlflow.log_param("C", C)
        mlflow.log_metric("test_accuracy", accuracy)
        mlflow.log_metric("macro_f1", macro_f1) #for f1
        mlflow.end_run()
        return accuracy

    # Training Loop for FNN and CNN
    criterion = nn.CrossEntropyLoss()

    model.train()
    for epoch in range(num_classifier_epochs):
        running_loss = 0.0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images) if CLASSIFIER == "CNN" else model(images.view(images.size(0), -1))

            optimizer.zero_grad()
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        print(f"{CLASSIFIER} Epoch {epoch+1}: loss = {running_loss / len(train_loader):.4f}")

    # Model Evaluation
    model.eval()
    correct, total = 0, 0
    all_preds = []   # for f1
    all_labels = [] 
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images) if CLASSIFIER == "CNN" else model(images.view(images.size(0), -1))
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            all_preds.extend(predicted.cpu().numpy())   #for f1
            all_labels.extend(labels.cpu().numpy()) #for f1

    accuracy = 100 * correct / total
    macro_f1 = f1_score(all_labels, all_preds, average="macro") #for f1
    print(f"Test Accuracy: {accuracy:.2f}%")
    print(f"Macro F1 Score: {macro_f1:.4f}") #for f1

    mlflow.log_metric("test_accuracy", accuracy)
    mlflow.log_metric("macro_f1", macro_f1) #for f1
    mlflow.end_run()
    return accuracy

if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=1) # n_trials set to 1 for quick rendering
    print(f"Best Parameters for {CLASSIFIER}:", study.best_params)
    print("Best Accuracy:", study.best_value)
Logistic Regression Test Accuracy: 83.96%
Macro F1 Score: 0.8384
Best Parameters for LogisticRegression: {'batch_size': 96, 'num_classifier_epochs': 5, 'C': 0.01287119915703718}
Best Accuracy: 83.96000000000001

[I 2025-04-04 14:06:30,938] A new study created in memory with name: no-name-fb21a50e-326e-4252-b5ee-ebf01ca10b43
[I 2025-04-04 14:06:37,594] Trial 0 finished with value: 83.96000000000001 and parameters: {'batch_size': 96, 'num_classifier_epochs': 5, 'C': 0.01287119915703718}. Best is trial 0 with value: 83.96000000000001.

Test Accuracy of Logistic Regression by C (inverse regularization strength)

\[ C = \frac{1}{\lambda} \quad \text{(inverse regularization strength)} \]

Lower values of C mean more regularization (higher penalties for larger weight coefficients)

What the plot shows:
Most optuna trials were lower values of C, so optimization favors stronger regularization. This is further evidenced by the clustering of higher accuracies for lower values of C. A possibly anomaly is seen at C=10 with fairly high accuracy; however, it’s still not higher than lower values of C.

Model 2: Feed Forward Network on Fashion MNIST Data

Click to Show Code and Output
Code
from sklearn.metrics import f1_score

CLASSIFIER = "FNN"  # Change for FNN, LogisticRegression, or CNN

# Define CNN model
class FashionCNN(nn.Module):
    def __init__(self, filters1, filters2, kernel1, kernel2):
        super(FashionCNN, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=filters1, kernel_size=kernel1, padding=1),
            nn.BatchNorm2d(filters1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.layer2 = nn.Sequential(
            nn.Conv2d(in_channels=filters1, out_channels=filters2, kernel_size=kernel2),
            nn.BatchNorm2d(filters2),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.fc1 = None #initialize first fully connected layer as none, defined later in fwd
        self.drop = nn.Dropout2d(0.25)
        self.fc2 = nn.Linear(in_features=600, out_features=120)
        self.fc3 = nn.Linear(in_features=120, out_features=10)
        

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        #Flatten tensor dynamically
        out = out.view(out.size(0), -1)
        if self.fc1 is None:
            self.fc1 = nn.Linear(out.shape[1], 600).to(x.device)
        out = self.fc1(out)
        out = self.drop(out)
        out = self.fc2(out)
        out = self.fc3(out)
        return out



# Define Optuna objective function
def objective(trial):
      # Set MLflow experiment name
    if CLASSIFIER == "LogisticRegression":
        experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-noRBM")
    elif CLASSIFIER == "FNN":
        experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-noRBM")
    elif CLASSIFIER == "CNN":
        experiment = mlflow.set_experiment("new-pytorch-fmnist-cnn-noRBM")
    batch_size = trial.suggest_int("batch_size", 64, 256, step=32)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    mlflow.start_run(experiment_id=experiment.experiment_id)
    num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5) 
    mlflow.log_param("num_classifier_epochs", num_classifier_epochs)

    if CLASSIFIER == "FNN":
        hidden_size = trial.suggest_int("fnn_hidden", 192, 384)
        learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025)

        mlflow.log_param("classifier", "FNN")
        mlflow.log_param("fnn_hidden", hidden_size)
        mlflow.log_param("learning_rate", learning_rate)

        model = nn.Sequential(
            nn.Linear(784, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 10)
        ).to(device)

        optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    elif CLASSIFIER == "CNN":
        filters1 = trial.suggest_int("filters1", 16, 64, step=16)
        filters2 = trial.suggest_int("filters2", 32, 128, step=32)
        kernel1 = trial.suggest_int("kernel1", 3, 5)
        kernel2 = trial.suggest_int("kernel2", 3, 5)
        learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025)

        mlflow.log_param("classifier", "CNN")
        mlflow.log_param("filters1", filters1)
        mlflow.log_param("filters2", filters2)
        mlflow.log_param("kernel1", kernel1)
        mlflow.log_param("kernel2", kernel2)
        mlflow.log_param("learning_rate", learning_rate)

        model = FashionCNN(filters1, filters2, kernel1, kernel2).to(device)
        optimizer = optim.Adam(model.parameters(), lr=learning_rate)

      
    elif CLASSIFIER == "LogisticRegression":
        mlflow.log_param("classifier", "LogisticRegression")
    
        # Prepare data for Logistic Regression (Flatten 28x28 images to 784 features)
        train_features = train_dataset.data.view(-1, 784).numpy()
        train_labels = train_dataset.targets.numpy()
        test_features = test_dataset.data.view(-1, 784).numpy()
        test_labels = test_dataset.targets.numpy()
    
        # Normalize the pixel values to [0,1] for better convergence
        train_features = train_features / 255.0
        test_features = test_features / 255.0
    
    
        C = trial.suggest_float("C", 0.01, 10.0, log=True)  
        solver = "saga" 
    
        model = LogisticRegression(C=C, max_iter=num_classifier_epochs, solver=solver)
        model.fit(train_features, train_labels)
    
    
        predictions = model.predict(test_features)
        accuracy = accuracy_score(test_labels, predictions) * 100
        
        macro_f1 = f1_score(test_labels, predictions, average="macro") #for f1
        print(f"Logistic Regression Test Accuracy: {accuracy:.2f}%")
        print(f"Macro F1 Score: {macro_f1:.4f}") #for f1
    
        mlflow.log_param("C", C)
        mlflow.log_metric("test_accuracy", accuracy)
        mlflow.log_metric("macro_f1", macro_f1) #for f1
        mlflow.end_run()
        return accuracy

    # Training Loop for FNN and CNN
    criterion = nn.CrossEntropyLoss()

    model.train()
    for epoch in range(num_classifier_epochs):
        running_loss = 0.0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images) if CLASSIFIER == "CNN" else model(images.view(images.size(0), -1))

            optimizer.zero_grad()
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        print(f"{CLASSIFIER} Epoch {epoch+1}: loss = {running_loss / len(train_loader):.4f}")

    # Model Evaluation
    model.eval()
    correct, total = 0, 0
    all_preds = []   # for f1
    all_labels = [] 
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images) if CLASSIFIER == "CNN" else model(images.view(images.size(0), -1))
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            all_preds.extend(predicted.cpu().numpy())   #for f1
            all_labels.extend(labels.cpu().numpy()) #for f1

    accuracy = 100 * correct / total
    macro_f1 = f1_score(all_labels, all_preds, average="macro") #for f1
    print(f"Test Accuracy: {accuracy:.2f}%")
    print(f"Macro F1 Score: {macro_f1:.4f}") #for f1

    mlflow.log_metric("test_accuracy", accuracy)
    mlflow.log_metric("macro_f1", macro_f1) #for f1
    mlflow.end_run()
    return accuracy

if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=1) # n_trials set to 1 for quick rendering
    print(f"Best Parameters for {CLASSIFIER}:", study.best_params)
    print("Best Accuracy:", study.best_value)
FNN Epoch 1: loss = 0.7546
FNN Epoch 2: loss = 0.4809
FNN Epoch 3: loss = 0.4320
FNN Epoch 4: loss = 0.4056
FNN Epoch 5: loss = 0.3834
Test Accuracy: 85.45%
Macro F1 Score: 0.8551
Best Parameters for FNN: {'batch_size': 160, 'num_classifier_epochs': 5, 'fnn_hidden': 266, 'learning_rate': 0.00028738067350860306}
Best Accuracy: 85.45

[I 2025-04-04 14:06:38,127] A new study created in memory with name: no-name-027a4714-06fd-48ad-81c3-4730a2618c0e
[I 2025-04-04 14:06:51,721] Trial 0 finished with value: 85.45 and parameters: {'batch_size': 160, 'num_classifier_epochs': 5, 'fnn_hidden': 266, 'learning_rate': 0.00028738067350860306}. Best is trial 0 with value: 85.45.

Test Accuracy by FNN Hidden Units

What the plot shows:
Higher values of hidden units in the feedforward network were sampled more frequently by Optuna, suggesting a preference for more complex models. However, test accuracy appears to level off between 300 and 375 hidden units, suggesting complexity reached its optimal range. Further increases in hidden units would likely not yield higher accuracy.

Model 3: Convolutional Neural Network on Fashion MNIST Data
Base code for CNN structure borrowed from Kaggle

Click to Show Code and Output
Code
from sklearn.metrics import f1_score

CLASSIFIER = "CNN"  # Change for FNN, LogisticRegression, or CNN

# Define CNN model
class FashionCNN(nn.Module):
    def __init__(self, filters1, filters2, kernel1, kernel2):
        super(FashionCNN, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=filters1, kernel_size=kernel1, padding=1),
            nn.BatchNorm2d(filters1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.layer2 = nn.Sequential(
            nn.Conv2d(in_channels=filters1, out_channels=filters2, kernel_size=kernel2),
            nn.BatchNorm2d(filters2),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.fc1 = None #initialize first fully connected layer as none, defined later in fwd
        self.drop = nn.Dropout2d(0.25)
        self.fc2 = nn.Linear(in_features=600, out_features=120)
        self.fc3 = nn.Linear(in_features=120, out_features=10)
        

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        #Flatten tensor dynamically
        out = out.view(out.size(0), -1)
        if self.fc1 is None:
            self.fc1 = nn.Linear(out.shape[1], 600).to(x.device)
        out = self.fc1(out)
        out = self.drop(out)
        out = self.fc2(out)
        out = self.fc3(out)
        return out



# Define Optuna objective function
def objective(trial):
        # Set MLflow experiment name
    if CLASSIFIER == "LogisticRegression":
        experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-noRBM")
    elif CLASSIFIER == "FNN":
        experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-noRBM")
    elif CLASSIFIER == "CNN":
        experiment = mlflow.set_experiment("new-pytorch-fmnist-cnn-noRBM")
    batch_size = trial.suggest_int("batch_size", 64, 256, step=32)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    mlflow.start_run(experiment_id=experiment.experiment_id)
    num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5) 
    mlflow.log_param("num_classifier_epochs", num_classifier_epochs)

    if CLASSIFIER == "FNN":
        hidden_size = trial.suggest_int("fnn_hidden", 192, 384)
        learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025)

        mlflow.log_param("classifier", "FNN")
        mlflow.log_param("fnn_hidden", hidden_size)
        mlflow.log_param("learning_rate", learning_rate)

        model = nn.Sequential(
            nn.Linear(784, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 10)
        ).to(device)

        optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    elif CLASSIFIER == "CNN":
        filters1 = trial.suggest_int("filters1", 16, 64, step=16)
        filters2 = trial.suggest_int("filters2", 32, 128, step=32)
        kernel1 = trial.suggest_int("kernel1", 3, 5)
        kernel2 = trial.suggest_int("kernel2", 3, 5)
        learning_rate = trial.suggest_float("learning_rate", 0.0001, 0.0025)

        mlflow.log_param("classifier", "CNN")
        mlflow.log_param("filters1", filters1)
        mlflow.log_param("filters2", filters2)
        mlflow.log_param("kernel1", kernel1)
        mlflow.log_param("kernel2", kernel2)
        mlflow.log_param("learning_rate", learning_rate)

        model = FashionCNN(filters1, filters2, kernel1, kernel2).to(device)
        optimizer = optim.Adam(model.parameters(), lr=learning_rate)

      
    elif CLASSIFIER == "LogisticRegression":
        mlflow.log_param("classifier", "LogisticRegression")
    
        # Prepare data for Logistic Regression (Flatten 28x28 images to 784 features)
        train_features = train_dataset.data.view(-1, 784).numpy()
        train_labels = train_dataset.targets.numpy()
        test_features = test_dataset.data.view(-1, 784).numpy()
        test_labels = test_dataset.targets.numpy()
    
        # Normalize the pixel values to [0,1] for better convergence
        train_features = train_features / 255.0
        test_features = test_features / 255.0
    
    
        C = trial.suggest_float("C", 0.01, 10.0, log=True)  
        solver = "saga" 
    
        model = LogisticRegression(C=C, max_iter=num_classifier_epochs, solver=solver)
        model.fit(train_features, train_labels)
    
    
        predictions = model.predict(test_features)
        accuracy = accuracy_score(test_labels, predictions) * 100
        
        macro_f1 = f1_score(test_labels, predictions, average="macro") #for f1
        print(f"Logistic Regression Test Accuracy: {accuracy:.2f}%")
        print(f"Macro F1 Score: {macro_f1:.4f}") #for f1
    
        mlflow.log_param("C", C)
        mlflow.log_metric("test_accuracy", accuracy)
        mlflow.log_metric("macro_f1", macro_f1) #for f1
        mlflow.end_run()
        return accuracy

    # Training Loop for FNN and CNN
    criterion = nn.CrossEntropyLoss()

    model.train()
    for epoch in range(num_classifier_epochs):
        running_loss = 0.0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images) if CLASSIFIER == "CNN" else model(images.view(images.size(0), -1))

            optimizer.zero_grad()
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        print(f"{CLASSIFIER} Epoch {epoch+1}: loss = {running_loss / len(train_loader):.4f}")

    # Model Evaluation
    model.eval()
    correct, total = 0, 0
    all_preds = []   # for f1
    all_labels = [] 
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images) if CLASSIFIER == "CNN" else model(images.view(images.size(0), -1))
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            all_preds.extend(predicted.cpu().numpy())   #for f1
            all_labels.extend(labels.cpu().numpy()) #for f1

    accuracy = 100 * correct / total
    macro_f1 = f1_score(all_labels, all_preds, average="macro") #for f1
    print(f"Test Accuracy: {accuracy:.2f}%")
    print(f"Macro F1 Score: {macro_f1:.4f}") #for f1

    mlflow.log_metric("test_accuracy", accuracy)
    mlflow.log_metric("macro_f1", macro_f1) #for f1
    mlflow.end_run()
    return accuracy

if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=1) # n_trials set to 1 for quick rendering
    print(f"Best Parameters for {CLASSIFIER}:", study.best_params)
    print("Best Accuracy:", study.best_value)
CNN Epoch 1: loss = 0.4713
CNN Epoch 2: loss = 0.3090
CNN Epoch 3: loss = 0.2772
CNN Epoch 4: loss = 0.2593
CNN Epoch 5: loss = 0.2421
Test Accuracy: 90.36%
Macro F1 Score: 0.9047
Best Parameters for CNN: {'batch_size': 224, 'num_classifier_epochs': 5, 'filters1': 64, 'filters2': 96, 'kernel1': 3, 'kernel2': 3, 'learning_rate': 0.0008774111162324804}
Best Accuracy: 90.36

[I 2025-04-04 14:06:52,110] A new study created in memory with name: no-name-a3574784-ab9e-44cb-b63e-47c63859eb38
[I 2025-04-04 14:07:29,015] Trial 0 finished with value: 90.36 and parameters: {'batch_size': 224, 'num_classifier_epochs': 5, 'filters1': 64, 'filters2': 96, 'kernel1': 3, 'kernel2': 3, 'learning_rate': 0.0008774111162324804}. Best is trial 0 with value: 90.36.

Test Accuracy Based on the Number of Filters in the First Conv2D Layer

What the plot shows:
Although the highest test accuracy was achieved with 64 filters in the first convolutional 2D layer, the number of filters alone isn’t a strong predictor of model performance. Each filter size shows high variance (accuracies are spread out for each value vertically). This, combined with the fact that accuracies are well distributed across the different filter counts, suggests other factors or hyperparameters may play a bigger role in predicting accuracy.

Test Accuracy Based on the Number of Filters in the Second Conv2D Layer

What the plot shows:
Like the first Conv2D layer, the number of filters doesn’t seem to be a extremely strong predictor in accuracy. However, Optuna has sampled more frequently from higher number of filters, even 128 for this second layer, suggesting higher filters performed better. However, like before, there is still high variance in accuracy for each number of filters.

Test Accuracy Based on Kernel Size in the First Conv2D Layer

What the plot shows:
Kernel size of 3 was sampled more frequently by Optuna and yielded higher accuracies than kernel sizes of 4 or 5.

Test Accuracy Based on Kernel Size in the Second Conv2D Layer

What the plot shows:
Like with the first convolutional 2D layer, kernel size of 3 is highly favored by Optuna and consistently led to higher test accuracies.

Model 4: Logistic Regression on RBM Hidden Features (of Fashion MNIST Data)

Click to Show Code and Output
Code
from sklearn.metrics import accuracy_score, f1_score
CLASSIFIER = 'LogisticRegression'

if CLASSIFIER == 'LogisticRegression':
    experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-withrbm")
else:
    experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-withrbm")


class RBM(nn.Module):
    def __init__(self, n_visible=784, n_hidden=256, k=1):
        super(RBM, self).__init__()
        self.n_visible = n_visible
        self.n_hidden = n_hidden
        # Initialize weights and biases
        self.W = nn.Parameter(torch.randn(n_hidden, n_visible) * 0.1)
        self.v_bias = nn.Parameter(torch.zeros(n_visible))
        self.h_bias = nn.Parameter(torch.zeros(n_hidden))
        self.k = k  # CD-k steps

    def sample_h(self, v):
        # Given visible v, sample hidden h
        p_h = torch.sigmoid(F.linear(v, self.W, self.h_bias))  # p(h=1|v)
        h_sample = torch.bernoulli(p_h)                        # sample Bernoulli
        return p_h, h_sample

    def sample_v(self, h):
        # Given hidden h, sample visible v
        p_v = torch.sigmoid(F.linear(h, self.W.t(), self.v_bias))  # p(v=1|h)
        v_sample = torch.bernoulli(p_v)
        return p_v, v_sample

    def forward(self, v):
        # Perform k steps of contrastive divergence starting from v
        v_k = v.clone()
        for _ in range(self.k):
            _, h_k = self.sample_h(v_k)    # sample hidden from current visible
            _, v_k = self.sample_v(h_k)    # sample visible from hidden
        return v_k  # k-step reconstructed visible

    def free_energy(self, v):
        # Compute the visible bias term for each sample in the batch
        vbias_term = (v * self.v_bias).sum(dim=1)  # shape: [batch_size]
        # Compute the activation of the hidden units
        wx_b = F.linear(v, self.W, self.h_bias)     # shape: [batch_size, n_hidden]
        # Compute the hidden term
        hidden_term = torch.sum(torch.log1p(torch.exp(wx_b)), dim=1)  # shape: [batch_size]
        # Return the mean free energy over the batch
        return - (vbias_term + hidden_term).mean()
    
transform = transforms.Compose([transforms.ToTensor()])
train_dataset = datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)

def objective(trial):
    num_rbm_epochs = trial.suggest_int("num_rbm_epochs", 5, 5)# 24, 33)
    batch_size = trial.suggest_int("batch_size", 192, 1024)
    rbm_lr = trial.suggest_float("rbm_lr", 0.05, 0.1)
    rbm_hidden = trial.suggest_int("rbm_hidden", 384, 8192)

    mlflow.start_run(experiment_id=experiment.experiment_id)
    if CLASSIFIER != 'LogisticRegression':
        fnn_hidden = trial.suggest_int("fnn_hidden", 192, 384)
        fnn_lr = trial.suggest_float("fnn_lr", 0.0001, 0.0025)
        mlflow.log_param("fnn_hidden", fnn_hidden)
        mlflow.log_param("fnn_lr", fnn_lr)

    num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5)# 40, 60)

    mlflow.log_param("num_rbm_epochs", num_rbm_epochs)
    mlflow.log_param("batch_size", batch_size)
    mlflow.log_param("rbm_lr", rbm_lr)
    mlflow.log_param("rbm_hidden", rbm_hidden)
    mlflow.log_param("num_classifier_epochs", num_classifier_epochs)

    # Instantiate RBM and optimizer
    device = torch.device("mps")
    rbm = RBM(n_visible=784, n_hidden=rbm_hidden, k=1).to(device)
    optimizer = torch.optim.SGD(rbm.parameters(), lr=rbm_lr)

    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    rbm_training_failed = False
    # Training loop (assuming train_loader yields batches of images and labels)
    for epoch in range(num_rbm_epochs):
        total_loss = 0.0
        for images, _ in train_loader:
            # Flatten images and binarize
            v0 = images.view(-1, 784).to(rbm.W.device)      # shape [batch_size, 784]
            v0 = torch.bernoulli(v0)                        # sample binary input
            vk = rbm(v0)                                    # k-step CD reconstruction
            # Compute contrastive divergence loss (free energy difference)
            loss = rbm.free_energy(v0) - rbm.free_energy(vk)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f"Epoch {epoch+1}: avg free-energy loss = {total_loss/len(train_loader):.4f}")
        if np.isnan(total_loss):
            rbm_training_failed = True
            break

        if rbm_training_failed:
            accuracy = 0.0
            macro_f1 = 0.0 
            print("RBM training failed — returning 0.0 for accuracy and macro F1")  
            mlflow.log_metric("test_accuracy", accuracy)
            mlflow.log_metric("macro_f1", macro_f1)
            mlflow.set_tag("status", "rbm_failed")  # Optional tag
            mlflow.end_run()
            return float(accuracy)
    else:
        rbm.eval()  # set in evaluation mode if using any layers that behave differently in training
        features_list = []
        labels_list = []
        for images, labels in train_loader:
            v = images.view(-1, 784).to(rbm.W.device)
            v = v  # (optionally binarize or use raw normalized pixels)
            h_prob, h_sample = rbm.sample_h(v)  # get hidden activations
            features_list.append(h_prob.cpu().detach().numpy())
            labels_list.append(labels.numpy())
        train_features = np.concatenate(features_list)  # shape: [N_train, n_hidden]
        train_labels = np.concatenate(labels_list)

        # Convert pre-extracted training features and labels to tensors and create a DataLoader
        train_features_tensor = torch.tensor(train_features, dtype=torch.float32)
        train_labels_tensor = torch.tensor(train_labels, dtype=torch.long)
        train_feature_dataset = torch.utils.data.TensorDataset(train_features_tensor, train_labels_tensor)
        train_feature_loader = torch.utils.data.DataLoader(train_feature_dataset, batch_size=batch_size, shuffle=True)

            
        if CLASSIFIER == 'LogisticRegression':
            # add optuna tuning same as log reg without RBM features...
            lr_C = trial.suggest_float("lr_C", 0.01, 10.0, log=True)  
            mlflow.log_param("lr_C", lr_C)  # Log the chosen C value

            classifier = LogisticRegression(max_iter=num_classifier_epochs, C=lr_C, solver="saga") 
            classifier.fit(train_features, train_labels)            
            
        else:
            classifier = nn.Sequential(
                nn.Linear(rbm.n_hidden, fnn_hidden),
                nn.ReLU(),
                nn.Linear(fnn_hidden, 10)
            )

            # Move classifier to the same device as the RBM
            classifier = classifier.to(device)
            criterion = nn.CrossEntropyLoss()
            classifier_optimizer = torch.optim.Adam(classifier.parameters(), lr=fnn_lr)

            classifier.train()
            for epoch in range(num_classifier_epochs):
                running_loss = 0.0
                for features, labels in train_feature_loader:
                    features = features.to(device)
                    labels = labels.to(device)
                    
                    # Forward pass through classifier
                    outputs = classifier(features)
                    loss = criterion(outputs, labels)
                    
                    # Backpropagation and optimization
                    classifier_optimizer.zero_grad()
                    loss.backward()
                    classifier_optimizer.step()
                    
                    running_loss += loss.item()
                avg_loss = running_loss / len(train_feature_loader)
                print(f"Classifier Epoch {epoch+1}: loss = {avg_loss:.4f}")

        # Evaluate the classifier on test data.
        # Here we extract features from the RBM for each test image.
        if CLASSIFIER != 'LogisticRegression':
            classifier.eval()
            correct = 0
            total = 0
        features_list = []
        labels_list = []
        with torch.no_grad():
            for images, labels in test_loader:
                v = images.view(-1, 784).to(device)
                # Extract hidden activations; you can use either h_prob or h_sample.
                h_prob, _ = rbm.sample_h(v)
                if CLASSIFIER == 'LogisticRegression':
                    features_list.append(h_prob.cpu().detach().numpy())
                    labels_list.append(labels.numpy())
                else:
                    outputs = classifier(h_prob)
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted.cpu() == labels).sum().item()

        if CLASSIFIER == 'LogisticRegression':
            test_features = np.concatenate(features_list)
            test_labels = np.concatenate(labels_list)
            predictions = classifier.predict(test_features)
            accuracy = accuracy_score(test_labels, predictions) * 100
        
            macro_f1 = f1_score(test_labels, predictions, average="macro") 
        
        else:
            accuracy = 100 * correct / total
        
            all_preds = [] 
            all_labels = [] 
            classifier.eval()
            with torch.no_grad():
                for images, labels in test_loader:
                    v = images.view(-1, 784).to(device)
                    h_prob, _ = rbm.sample_h(v)
                    outputs = classifier(h_prob)
                    _, predicted = torch.max(outputs.data, 1)
                    all_preds.extend(predicted.cpu().numpy()) 
                    all_labels.extend(labels.numpy()) 
        
            macro_f1 = f1_score(all_labels, all_preds, average="macro") 
        
        print(f"Test Accuracy: {accuracy:.2f}%")
        print(f"Macro F1 Score: {macro_f1:.4f}") 
        
        mlflow.log_metric("test_accuracy", accuracy)
        mlflow.log_metric("macro_f1", macro_f1) 
        mlflow.end_run()
        return float(accuracy if accuracy is not None else 0.0)

if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=1) # n_trials set to 1 for quick rendering
    print(study.best_params)
    print(study.best_value)
    print(study.best_trial)
Epoch 1: avg free-energy loss = 41.8018
Epoch 2: avg free-energy loss = 7.7330
Epoch 3: avg free-energy loss = 4.1437
Epoch 4: avg free-energy loss = 2.3385
Epoch 5: avg free-energy loss = 1.2435
Test Accuracy: 85.87%
Macro F1 Score: 0.8578
{'num_rbm_epochs': 5, 'batch_size': 945, 'rbm_lr': 0.0996530039470581, 'rbm_hidden': 1663, 'num_classifier_epochs': 5, 'lr_C': 5.040908683259759}
85.87
FrozenTrial(number=0, state=1, values=[85.87], datetime_start=datetime.datetime(2025, 4, 4, 14, 7, 29, 528873), datetime_complete=datetime.datetime(2025, 4, 4, 14, 7, 58, 976742), params={'num_rbm_epochs': 5, 'batch_size': 945, 'rbm_lr': 0.0996530039470581, 'rbm_hidden': 1663, 'num_classifier_epochs': 5, 'lr_C': 5.040908683259759}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'num_rbm_epochs': IntDistribution(high=5, log=False, low=5, step=1), 'batch_size': IntDistribution(high=1024, log=False, low=192, step=1), 'rbm_lr': FloatDistribution(high=0.1, log=False, low=0.05, step=None), 'rbm_hidden': IntDistribution(high=8192, log=False, low=384, step=1), 'num_classifier_epochs': IntDistribution(high=5, log=False, low=5, step=1), 'lr_C': FloatDistribution(high=10.0, log=True, low=0.01, step=None)}, trial_id=0, value=None)

[I 2025-04-04 14:07:29,528] A new study created in memory with name: no-name-3a02a0b6-ea4f-41c9-bd1e-db2585e63bcd
[I 2025-04-04 14:07:58,976] Trial 0 finished with value: 85.87 and parameters: {'num_rbm_epochs': 5, 'batch_size': 945, 'rbm_lr': 0.0996530039470581, 'rbm_hidden': 1663, 'num_classifier_epochs': 5, 'lr_C': 5.040908683259759}. Best is trial 0 with value: 85.87.

Test Accuracy of Logistic Regression on RBM Hidden Features by Inverse Regularization Strength

What the plot shows:
When using RBM-extracted hidden features as input to logistic regression, the inverse regularization strength does not appear to be a strong predictor of test accuracy.

Test Accuracy By Number of RBM Hidden Units

What the plot shows:
Optuna slightly favors higher number of hidden units in the rbm with a peak at 5340 (and similar peaks 5358, 5341, etc.). However, after 7000 units, accuracy appears to decline suggesting the optimum number of units was reached around that 5300 mark.

Model 5: Feed Forward Network on RBM Hidden Features (of Fashion MNIST Data)

Click to Show Code and Output
Code
from sklearn.metrics import accuracy_score, f1_score
CLASSIFIER = 'FNN'

if CLASSIFIER == 'LogisticRegression':
    experiment = mlflow.set_experiment("new-pytorch-fmnist-lr-withrbm")
else:
    experiment = mlflow.set_experiment("new-pytorch-fmnist-fnn-withrbm")


class RBM(nn.Module):
    def __init__(self, n_visible=784, n_hidden=256, k=1):
        super(RBM, self).__init__()
        self.n_visible = n_visible
        self.n_hidden = n_hidden
        # Initialize weights and biases
        self.W = nn.Parameter(torch.randn(n_hidden, n_visible) * 0.1)
        self.v_bias = nn.Parameter(torch.zeros(n_visible))
        self.h_bias = nn.Parameter(torch.zeros(n_hidden))
        self.k = k  # CD-k steps

    def sample_h(self, v):
        # Given visible v, sample hidden h
        p_h = torch.sigmoid(F.linear(v, self.W, self.h_bias))  # p(h=1|v)
        h_sample = torch.bernoulli(p_h)                        # sample Bernoulli
        return p_h, h_sample

    def sample_v(self, h):
        # Given hidden h, sample visible v
        p_v = torch.sigmoid(F.linear(h, self.W.t(), self.v_bias))  # p(v=1|h)
        v_sample = torch.bernoulli(p_v)
        return p_v, v_sample

    def forward(self, v):
        # Perform k steps of contrastive divergence starting from v
        v_k = v.clone()
        for _ in range(self.k):
            _, h_k = self.sample_h(v_k)    # sample hidden from current visible
            _, v_k = self.sample_v(h_k)    # sample visible from hidden
        return v_k  # k-step reconstructed visible

    def free_energy(self, v):
        # Compute the visible bias term for each sample in the batch
        vbias_term = (v * self.v_bias).sum(dim=1)  # shape: [batch_size]
        # Compute the activation of the hidden units
        wx_b = F.linear(v, self.W, self.h_bias)     # shape: [batch_size, n_hidden]
        # Compute the hidden term
        hidden_term = torch.sum(torch.log1p(torch.exp(wx_b)), dim=1)  # shape: [batch_size]
        # Return the mean free energy over the batch
        return - (vbias_term + hidden_term).mean()
    
transform = transforms.Compose([transforms.ToTensor()])
train_dataset = datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)

def objective(trial):
    num_rbm_epochs = trial.suggest_int("num_rbm_epochs", 5, 5)# 24, 33)
    batch_size = trial.suggest_int("batch_size", 192, 1024)
    rbm_lr = trial.suggest_float("rbm_lr", 0.05, 0.1)
    rbm_hidden = trial.suggest_int("rbm_hidden", 384, 8192)

    mlflow.start_run(experiment_id=experiment.experiment_id)
    if CLASSIFIER != 'LogisticRegression':
        fnn_hidden = trial.suggest_int("fnn_hidden", 192, 384)
        fnn_lr = trial.suggest_float("fnn_lr", 0.0001, 0.0025)
        mlflow.log_param("fnn_hidden", fnn_hidden)
        mlflow.log_param("fnn_lr", fnn_lr)

    num_classifier_epochs = trial.suggest_int("num_classifier_epochs", 5, 5)# 40, 60)

    mlflow.log_param("num_rbm_epochs", num_rbm_epochs)
    mlflow.log_param("batch_size", batch_size)
    mlflow.log_param("rbm_lr", rbm_lr)
    mlflow.log_param("rbm_hidden", rbm_hidden)
    mlflow.log_param("num_classifier_epochs", num_classifier_epochs)

    # Instantiate RBM and optimizer
    device = torch.device("mps")
    rbm = RBM(n_visible=784, n_hidden=rbm_hidden, k=1).to(device)
    optimizer = torch.optim.SGD(rbm.parameters(), lr=rbm_lr)

    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    rbm_training_failed = False
    # Training loop (assuming train_loader yields batches of images and labels)
    for epoch in range(num_rbm_epochs):
        total_loss = 0.0
        for images, _ in train_loader:
            # Flatten images and binarize
            v0 = images.view(-1, 784).to(rbm.W.device)      # shape [batch_size, 784]
            v0 = torch.bernoulli(v0)                        # sample binary input
            vk = rbm(v0)                                    # k-step CD reconstruction
            # Compute contrastive divergence loss (free energy difference)
            loss = rbm.free_energy(v0) - rbm.free_energy(vk)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f"Epoch {epoch+1}: avg free-energy loss = {total_loss/len(train_loader):.4f}")
        if np.isnan(total_loss):
            rbm_training_failed = True
            break

        if rbm_training_failed:
            accuracy = 0.0
            macro_f1 = 0.0 
            print("RBM training failed — returning 0.0 for accuracy and macro F1")  
            mlflow.log_metric("test_accuracy", accuracy)
            mlflow.log_metric("macro_f1", macro_f1)
            mlflow.set_tag("status", "rbm_failed")  # Optional tag
            mlflow.end_run()
            return float(accuracy)
    else:
        rbm.eval()  # set in evaluation mode if using any layers that behave differently in training
        features_list = []
        labels_list = []
        for images, labels in train_loader:
            v = images.view(-1, 784).to(rbm.W.device)
            v = v  # (optionally binarize or use raw normalized pixels)
            h_prob, h_sample = rbm.sample_h(v)  # get hidden activations
            features_list.append(h_prob.cpu().detach().numpy())
            labels_list.append(labels.numpy())
        train_features = np.concatenate(features_list)  # shape: [N_train, n_hidden]
        train_labels = np.concatenate(labels_list)

        # Convert pre-extracted training features and labels to tensors and create a DataLoader
        train_features_tensor = torch.tensor(train_features, dtype=torch.float32)
        train_labels_tensor = torch.tensor(train_labels, dtype=torch.long)
        train_feature_dataset = torch.utils.data.TensorDataset(train_features_tensor, train_labels_tensor)
        train_feature_loader = torch.utils.data.DataLoader(train_feature_dataset, batch_size=batch_size, shuffle=True)

            
        if CLASSIFIER == 'LogisticRegression':
            # add optuna tuning same as log reg without RBM features...
            lr_C = trial.suggest_float("lr_C", 0.01, 10.0, log=True)  
            mlflow.log_param("lr_C", lr_C)  # Log the chosen C value

            classifier = LogisticRegression(max_iter=num_classifier_epochs, C=lr_C, solver="saga") 
            classifier.fit(train_features, train_labels)            
            
        else:
            classifier = nn.Sequential(
                nn.Linear(rbm.n_hidden, fnn_hidden),
                nn.ReLU(),
                nn.Linear(fnn_hidden, 10)
            )

            # Move classifier to the same device as the RBM
            classifier = classifier.to(device)
            criterion = nn.CrossEntropyLoss()
            classifier_optimizer = torch.optim.Adam(classifier.parameters(), lr=fnn_lr)

            classifier.train()
            for epoch in range(num_classifier_epochs):
                running_loss = 0.0
                for features, labels in train_feature_loader:
                    features = features.to(device)
                    labels = labels.to(device)
                    
                    # Forward pass through classifier
                    outputs = classifier(features)
                    loss = criterion(outputs, labels)
                    
                    # Backpropagation and optimization
                    classifier_optimizer.zero_grad()
                    loss.backward()
                    classifier_optimizer.step()
                    
                    running_loss += loss.item()
                avg_loss = running_loss / len(train_feature_loader)
                print(f"Classifier Epoch {epoch+1}: loss = {avg_loss:.4f}")

        # Evaluate the classifier on test data.
        # Here we extract features from the RBM for each test image.
        if CLASSIFIER != 'LogisticRegression':
            classifier.eval()
            correct = 0
            total = 0
        features_list = []
        labels_list = []
        with torch.no_grad():
            for images, labels in test_loader:
                v = images.view(-1, 784).to(device)
                # Extract hidden activations; you can use either h_prob or h_sample.
                h_prob, _ = rbm.sample_h(v)
                if CLASSIFIER == 'LogisticRegression':
                    features_list.append(h_prob.cpu().detach().numpy())
                    labels_list.append(labels.numpy())
                else:
                    outputs = classifier(h_prob)
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted.cpu() == labels).sum().item()

        if CLASSIFIER == 'LogisticRegression':
            test_features = np.concatenate(features_list)
            test_labels = np.concatenate(labels_list)
            predictions = classifier.predict(test_features)
            accuracy = accuracy_score(test_labels, predictions) * 100
        
            macro_f1 = f1_score(test_labels, predictions, average="macro") 
        
        else:
            accuracy = 100 * correct / total
        
            all_preds = [] 
            all_labels = [] 
            classifier.eval()
            with torch.no_grad():
                for images, labels in test_loader:
                    v = images.view(-1, 784).to(device)
                    h_prob, _ = rbm.sample_h(v)
                    outputs = classifier(h_prob)
                    _, predicted = torch.max(outputs.data, 1)
                    all_preds.extend(predicted.cpu().numpy()) 
                    all_labels.extend(labels.numpy()) 
        
            macro_f1 = f1_score(all_labels, all_preds, average="macro") 
        
        print(f"Test Accuracy: {accuracy:.2f}%")
        print(f"Macro F1 Score: {macro_f1:.4f}") 
        
        mlflow.log_metric("test_accuracy", accuracy)
        mlflow.log_metric("macro_f1", macro_f1) 
        mlflow.end_run()
        return float(accuracy if accuracy is not None else 0.0)

if __name__ == "__main__":
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=1) # n_trials set to 1 for quick rendering
    print(study.best_params)
    print(study.best_value)
    print(study.best_trial)
Epoch 1: avg free-energy loss = 122.3159
Epoch 2: avg free-energy loss = 30.9675
Epoch 3: avg free-energy loss = 21.1254
Epoch 4: avg free-energy loss = 16.9566
Epoch 5: avg free-energy loss = 14.2925
Classifier Epoch 1: loss = 0.6824
Classifier Epoch 2: loss = 0.4721
Classifier Epoch 3: loss = 0.4332
Classifier Epoch 4: loss = 0.4067
Classifier Epoch 5: loss = 0.3927
Test Accuracy: 84.76%
Macro F1 Score: 0.8483
{'num_rbm_epochs': 5, 'batch_size': 711, 'rbm_lr': 0.06293811127059241, 'rbm_hidden': 3824, 'fnn_hidden': 239, 'fnn_lr': 0.0009420711551477983, 'num_classifier_epochs': 5}
84.76
FrozenTrial(number=0, state=1, values=[84.76], datetime_start=datetime.datetime(2025, 4, 4, 14, 7, 59, 511598), datetime_complete=datetime.datetime(2025, 4, 4, 14, 8, 26, 233977), params={'num_rbm_epochs': 5, 'batch_size': 711, 'rbm_lr': 0.06293811127059241, 'rbm_hidden': 3824, 'fnn_hidden': 239, 'fnn_lr': 0.0009420711551477983, 'num_classifier_epochs': 5}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'num_rbm_epochs': IntDistribution(high=5, log=False, low=5, step=1), 'batch_size': IntDistribution(high=1024, log=False, low=192, step=1), 'rbm_lr': FloatDistribution(high=0.1, log=False, low=0.05, step=None), 'rbm_hidden': IntDistribution(high=8192, log=False, low=384, step=1), 'fnn_hidden': IntDistribution(high=384, log=False, low=192, step=1), 'fnn_lr': FloatDistribution(high=0.0025, log=False, low=0.0001, step=None), 'num_classifier_epochs': IntDistribution(high=5, log=False, low=5, step=1)}, trial_id=0, value=None)

[I 2025-04-04 14:07:59,511] A new study created in memory with name: no-name-cab1e7ca-b217-456b-9357-c610a81bb262
[I 2025-04-04 14:08:26,234] Trial 0 finished with value: 84.76 and parameters: {'num_rbm_epochs': 5, 'batch_size': 711, 'rbm_lr': 0.06293811127059241, 'rbm_hidden': 3824, 'fnn_hidden': 239, 'fnn_lr': 0.0009420711551477983, 'num_classifier_epochs': 5}. Best is trial 0 with value: 84.76.

Test Accuracy by RBM Hidden Units

What the plot shows:
Highest accuracies cluster between 2000 and 4000 hidden units in the RBM with an outlier at 3764 hidden units. This possibly suggests too few hidden units lacks the complexity needed to explain the data; however, too many hidden units is perhaps causing some overfitting–resulting in poor generalization of the FNN classifier that receives the RBM hidden features.

Test Accuracy by FNN Hidden Units

What the plot shows:
Surprisingly, the number of hidden units in the FNN does not show a strong correlation with test accuracy. All hidden units tested seem to result in similar performance. This suggests the FNN is able to learn from the RBM features sufficently, and additional neurons do not significantly improve generalization.

Model Optuna Best Trial
MLflow Test Accuracy(%)
Macro F1 Score
Logistic Regression 84.71 0.846
Feed Forward Network 88.06 0.879
Convolutional Neural Network 91.29 0.913
Logistic Regression (on RBM Hidden Features) 87.14 0.871
Feed Forward Network (on RBM Hidden Features) 86.95 0.869

Conclusion

  • Summarize your key findings.

CNN clearly outperforms other models. Logistic Regression, which typically performs well for binary classifications tasks, underperforms on Fashion MNIST multiclassification task. Logistic Regression is improved by using a Restricted Boltzmann Machine first to extract the hidden features from the input data prior to classification. Feed Forward Network is not improved by the use of RBM. These findings clearly show the progress in machine and deep learning and how more advanced neural networks on raw pixels can outperform models that use RBM hidden features.

  • Discuss the implications of your results.

Restricted Boltzmann Machines are no longer considered state-of-the-art for machine learning tasks. While contrastive divergence made training RBMs easier, supervised training of deep feedforward networks and convolutional neural networks using backpropagation proved to be more effective and began to dominate the field. This was largely the result of overcoming challenges with exploding or vanishing gradients and the introduction of techniques such as batch normalization, dropout, and better weight initialization methods.

However, for the student of machine learning, learning RBMs is still valuable for understanding the foundations of unsupervised learning and energy-based models. Today, Stable Diffusion is a popular type of generative AI model with an energy-based foundation. The mechanics of RBM training, like Gibbs sampling, and the probabilistic nature of the model provide a demonstration of the application of probability theory and concepts like Markov chains and Boltzmann distributions in machine learning.

References

Akiba, Takuya, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A Next-Generation Hyperparameter Optimization Framework.” In The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–31.
Aslan, Narin, Sengul Dogan, and Gonca Ozmen Koca. 2023. “Automated Classification of Brain Diseases Using the Restricted Boltzmann Machine and the Generative Adversarial Network.” Engineering Applications of Artificial Intelligence 126: 106794.
Fiore, Ugo, Francesco Palmieri, Aniello Castiglione, and Alfredo De Santis. 2013. “Network Anomaly Detection with the Restricted Boltzmann Machine.” Neurocomputing 122: 13–23.
Fischer, Asja, and Christian Igel. 2012. “An Introduction to Restricted Boltzmann Machines.” In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 17th Iberoamerican Congress, CIARP 2012, Buenos Aires, Argentina, September 3-6, 2012. Proceedings 17, 14–36. Springer.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
Hinton, Geoffrey. 2010. “A Practical Guide to Training Restricted Boltzmann Machines.” Momentum 9 (1): 926.
Hinton, Geoffrey E. 2002. “Training Products of Experts by Minimizing Contrastive Divergence.” Neural Computation 14 (8): 1771–1800.
Melko, Roger G, Giuseppe Carleo, Juan Carrasquilla, and J Ignacio Cirac. 2019. “Restricted Boltzmann Machines in Quantum Physics.” Nature Physics 15 (9): 887–92.
Ning, Lin, Randall Pittman, and Xipeng Shen. 2018. “LCD: A Fast Contrastive Divergence Based Algorithm for Restricted Boltzmann Machine.” Neural Networks 108: 399–410.
O’Shea, Keiron, and Ryan Nash. 2015. “An Introduction to Convolutional Neural Networks.” https://arxiv.org/abs/1511.08458.
Oh, Sangchul, Abdelkader Baggag, and Hyunchul Nha. 2020. “Entropy, Free Energy, and Work of Restricted Boltzmann Machines.” Entropy 22 (5): 538.
Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M Ingersoll. 2002. “An Introduction to Logistic Regression Analysis and Reporting.” The Journal of Educational Research 96 (1): 3–14.
Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. 2007. “Restricted Boltzmann Machines for Collaborative Filtering.” In Proceedings of the 24th International Conference on Machine Learning, 791–98.
Sazlı, Murat H. 2006. “A Brief Review of Feed-Forward Neural Networks.” Communications Faculty of Sciences University of Ankara Series A2-A3 Physical Sciences and Engineering 50 (01).
Smolensky, Paul et al. 1986. “Information Processing in Dynamical Systems: Foundations of Harmony Theory.”
Xiao, Han, Kashif Rasul, and Roland Vollgraf. 2017. “Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms.” August 28, 2017. https://arxiv.org/abs/cs.LG/1708.07747.
Zaharia, Matei, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, et al. 2018. “Accelerating the Machine Learning Lifecycle with MLflow.” IEEE Data Eng. Bull. 41 (4): 39–45.
Zhang, Nan, Shifei Ding, Jian Zhang, and Yu Xue. 2018. “An Overview on Restricted Boltzmann Machines.” Neurocomputing 275: 1186–99.